Skip to content

feat: model and serializer for audio tracks#426

Merged
ceberam merged 22 commits intomainfrom
dev/webvtt
Jan 27, 2026
Merged

feat: model and serializer for audio tracks#426
ceberam merged 22 commits intomainfrom
dev/webvtt

Conversation

@ceberam
Copy link
Member

@ceberam ceberam commented Nov 17, 2025

This PR introduces a new type of provenance object for media files.

  • ProvenanceTrack for media tracks
  • Refactoring of other classes for provenance type check
  • Utility classes and methods in a new module on WebVTT for reuse
  • For testing the integration with docling, check the draft PR feat: webvtt and source tracker docling#2787

It also addresses docling-project/docling#2525

@github-actions
Copy link
Contributor

github-actions bot commented Nov 17, 2025

DCO Check Passed

Thanks @ceberam, all your commits are properly signed off. 🎉

@mergify
Copy link

mergify bot commented Nov 17, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

🟢 Require two reviewer for test updates

Wonderful, this rule succeeded.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

@ceberam ceberam force-pushed the dev/webvtt branch 2 times, most recently from 295c5ac to 0c201e2 Compare November 17, 2025 08:43
@codecov
Copy link

codecov bot commented Nov 17, 2025

Codecov Report

❌ Patch coverage is 90.01721% with 58 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
docling_core/transforms/serializer/webvtt.py 82.01% 34 Missing ⚠️
docling_core/types/doc/webvtt.py 93.16% 24 Missing ⚠️

📢 Thoughts on this report? Let us know!

@ceberam ceberam self-assigned this Nov 27, 2025
@ceberam ceberam force-pushed the dev/webvtt branch 2 times, most recently from 9e26200 to a0f4404 Compare December 4, 2025 10:02
@ceberam ceberam added the enhancement New feature or request label Dec 4, 2025
@ceberam ceberam force-pushed the dev/webvtt branch 3 times, most recently from c2efd15 to 4ecde67 Compare December 5, 2025 14:20
@ceberam ceberam marked this pull request as ready for review December 15, 2025 08:39
@dosubot
Copy link

dosubot bot commented Dec 15, 2025

Related Documentation

Checked 7 published document(s) in 0 knowledge base(s). No updates required.

How did I do? Any feedback?  Join Discord

@ceberam ceberam force-pushed the dev/webvtt branch 2 times, most recently from c56d14a to c56a691 Compare January 6, 2026 15:15
@ceberam ceberam force-pushed the dev/webvtt branch 4 times, most recently from 19a2dbe to 5e0a787 Compare January 23, 2026 15:26
PeterStaar-IBM
PeterStaar-IBM previously approved these changes Jan 26, 2026
Copy link
Member

@PeterStaar-IBM PeterStaar-IBM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Copy link
Member

@vagenas vagenas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Align new union naming to differentiate from "legacy" prov (e.g. call it SourceType?), also aligned with new field
  2. Unroll the tags field into the individual fields we consider relevant for now, e.g. "voice" / "id" (I don't think we need "classes" now, as it's quite VTT-specific). The fact that we would be losing some information on VTT is actually consistent with other import/export paths (e.g. HTML with embedded CSS).
  3. validate upon assignment could be considered outside this PR
  4. DoclingDocument.add_text() can be extended with a new optional param for source

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
…le types

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Since WebVTTTimestamp is used in DoclingDocument, the class should be public.
Strengthen validation of cue language start tag annotation.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Add a DoclingDocument serializer to WebVTT format.
Improve WebVTT data model.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Add 'text/vtt' as extra MIME type to support WebVTT serialization, since it is not
supported by 'mimetypes' with python < 3.11

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Classes and fields that are related to the new source type should aign with their names.
The term 'provenance' will identify the legacy implementation.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Drop the validation on field assignment in NodeItem objects.
Add the 'source' argument in the convenient function 'add_text' to create TextItem with track source data.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

refactor(webvtt): drop cue span classes, 'lang' and 'c' tags

Drop WebVTT formatting features not covered by Docling across formats.
Only 'u', 'b', 'i', and 'v' are supported and without classes.
Make 'v' tag explicit as 'voice' feature in SourceTrack class.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
@ceberam
Copy link
Member Author

ceberam commented Jan 26, 2026

Thanks @vagenas for the review.
I have pushed some commits addressing the points you raised.

  1. Align new union naming to differentiate from "legacy" prov (e.g. call it SourceType?), also aligned with new field

OK ✅

2. Unroll the `tags` field into the individual fields we consider relevant for now, e.g. "voice" / "id" (I don't think we need "classes" now, as it's quite VTT-specific). The fact that we would be losing some information on VTT is actually consistent with other import/export paths (e.g. HTML with embedded CSS).

⚠️ OK, but we need to be aware that we are no longer supporting all the WebVTT cue span features anymore and therefore the cycle

WebVTT file --> (WebVTTDocumentBackend) --> DoclingDocument -> (WebVTTDocSerializer) --> WebVTT file

may now imply loss of information, even though the resulting file will still be a valid WebVTT according to the specs.

3. validate upon assignment could be considered outside this PR

⚠️ OK, but let's see later on if we want to activate it as a safeguard. I have seen in docling-core some patterns of field assignments after instantiating a model.

4. `DoclingDocument.add_text()` can be extended with a new optional param for `source`

OK ✅

Copy link
Member

@cau-git cau-git left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ceberam ceberam merged commit c8f3c01 into main Jan 27, 2026
12 checks passed
@ceberam ceberam deleted the dev/webvtt branch January 27, 2026 13:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants